{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OQxl99l0bZac"
      },
      "source": [
        "##### Copyright 2022 The TensorFlow Authors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "YHz2D-oIqBWa"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jXslvcRocA-0"
      },
      "source": [
        "# Private Heavy Hitters"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0XBJJIqwcXKd"
      },
      "source": [
        "\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca target=\"_blank\" href=\"https://www.tensorflow.org/federated/tutorials/private_heavy_hitters\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" /\u003eView on TensorFlow.org\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/federated/blob/v0.84.0/docs/tutorials/private_heavy_hitters.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca target=\"_blank\" href=\"https://github.com/tensorflow/federated/blob/v0.84.0/docs/tutorials/private_heavy_hitters.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView source on GitHub\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca href=\"https://storage.googleapis.com/tensorflow_docs/federated/docs/tutorials/private_heavy_hitters.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/download_logo_32px.png\" /\u003eDownload notebook\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "\u003c/table\u003e"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JJqFp24bb2JN"
      },
      "source": [
        "**NOTE**: This colab has been verified to work with the [latest released version](https://github.com/tensorflow/federated#compatibility) of the `tensorflow_federated` pip package. This colab may not be updated to work against `main`.\n",
        "\n",
        "This tutorial shows how to use the `tff.analytics.heavy_hitters.iblt.build_iblt_computation` API to build a federated analytics computation to discover the most frequent strings (private heavy hitters) in the population."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MnUwFbCAKB2r"
      },
      "source": [
        "## Environment Setup\n",
        "\n",
        "Please run the following to make sure that your environment is\n",
        "correctly setup. If you don't see a greeting, please refer to the\n",
        "[Installation](../install.md) guide for instructions. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ZrGitA_KnRO0"
      },
      "outputs": [],
      "source": [
        "#@test {\"skip\": true}\n",
        "\n",
        "# tensorflow_federated_nightly also bring in tf_nightly, which\n",
        "# can causes a duplicate tensorboard install, leading to errors.\n",
        "!pip install --quiet tensorflow-text-nightly\n",
        "!pip install --quiet --upgrade tensorflow-federated"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8BKyHkMxKHfV"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "b'Hello, World!'"
            ]
          },
          "execution_count": 1,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "import collections\n",
        "\n",
        "import numpy as np\n",
        "import tensorflow as tf\n",
        "import tensorflow_federated as tff\n",
        "import tensorflow_text as tf_text\n",
        "\n",
        "np.random.seed(0)\n",
        "tff.backends.test.set_sync_test_cpp_execution_context()\n",
        "\n",
        "tff.federated_computation(lambda: 'Hello, World!')()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BhLs5GNQ-wWu"
      },
      "source": [
        "## Background: Private Heavy Hitters in Federated Analytics"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XgGacXm1mVE3"
      },
      "source": [
        "Consider the following setting: Each client has a list of strings, and each string is from an open set, which means it could be arbitrary. The goal is to discover the most popular strings (**heavy hitters**) and their counts privately in a federated setting. This colab demonstrates a solution to this problem with the following privacy properties:\n",
        "\n",
        "*   Secure aggregation: Computes the aggregated string counts such that it should not be possible for the server to learn any client's individual value. See `tff.federated_secure_sum` for more information.\n",
        "*   Differential pirvacy (DP): A widely used method for bounding and quantifying the privacy leakage of sensitive data in analytics. You can apply user-level central DP to the heavy hitter results. \n",
        "\n",
        "The secure aggregation API `tff.federated_secure_sum` supports linear sums of integer vectors. If the strings are from a closed set of size `n`, then it is easy to encode each client's strings to a vector of size `n`: let the value at index `i` of the vector be the count of the `i`\u003csup\u003eth\u003c/sup\u003e string in the closed set. Then you can securely sum the vectors of all clients to get the counts of strings in the whole population. However, if the strings are from an open set, it is not obvious how to encode them properly for secure sum. In this work, you can encode the strings into [Invertible Bloom Lookup Tables (IBLT)](https://arxiv.org/abs/1101.2245), which is a probabilistic data structure that have the ability to encode items in a large (or open) domain in an efficient manner. IBLT sketches can be linearly summed, so they are compatible with secure sum.\n",
        "\n",
        "You can use `tff.analytics.heavy_hitters.iblt.build_iblt_computation` to create a TFF computation that encodes each client's local strings to an IBLT structure. These structures are securely summed via a cryptographic secure multi-party computation protocol into an aggregated IBLT structure which the server can decode. The server then can return the top heavy hitters. The following sections show how to use this API to create a TFF computation and run simulations with the Shakespeare dataset. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CFY_3z-x-3r6"
      },
      "source": [
        "## Load and Preprocess the Federated Shakespeare Data"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0O1CHhdDJcij"
      },
      "source": [
        "The Shakespeare dataset contains lines of characters of Shakespeare plays. In this example, a subset of characters (that is, clients) are selected.  A preprocessor converts each character's lines into a list of strings, and any string that is only punctuation or symbols is dropped."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "b65q5mp4r1n7"
      },
      "outputs": [],
      "source": [
        "# Load the simulation data.\n",
        "source, _ = tff.simulation.datasets.shakespeare.load_data()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ReoTRs8ntJw7"
      },
      "outputs": [],
      "source": [
        "# Preprocessing function to tokenize a line into words.\n",
        "def tokenize(ds):\n",
        "  \"\"\"Tokenizes a line into words with alphanum characters.\"\"\"\n",
        "  def extract_strings(example):\n",
        "    return tf.expand_dims(example['snippets'], 0)\n",
        "\n",
        "  def tokenize_line(line):\n",
        "    return tf.data.Dataset.from_tensor_slices(tokenizer.tokenize(line)[0])\n",
        "\n",
        "  def mask_all_symbolic_words(word):\n",
        "    return tf.math.logical_not(\n",
        "        tf_text.wordshape(word, tf_text.WordShape.IS_PUNCT_OR_SYMBOL))\n",
        "\n",
        "  tokenizer = tf_text.WhitespaceTokenizer()\n",
        "  ds = ds.map(extract_strings)\n",
        "  ds = ds.flat_map(tokenize_line)\n",
        "  ds = ds.map(tf_text.case_fold_utf8)\n",
        "  ds = ds.filter(mask_all_symbolic_words)\n",
        "  return ds\n",
        "\n",
        "batch_size = 5\n",
        "\n",
        "def client_data(n: int) -\u003e tf.data.Dataset:\n",
        "  return tokenize(source.create_tf_dataset_for_client(\n",
        "      source.client_ids[n])).batch(batch_size)\n",
        "\n",
        "# Pick a subset of client devices to participate in the computation.\n",
        "dataset = [client_data(n) for n in range(10)]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iDGwJsssK9_e"
      },
      "source": [
        "## Simulations"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZtCRYhI0nKcm"
      },
      "source": [
        "To run simulations to discover the most popular words (heavy hitters) in the Shakespeare dataset, first you need to create a TFF computation using the `tff.analytics.heavy_hitters.iblt.build_iblt_computation` API with the following parameters:\n",
        "\n",
        "* `capacity`: The capacity of the IBLT sketch. This number should be roughly the total number of unique strings that could appear in one round of computation. Defaults to `1000`. If this number is too small, the decoding could fail due to collision of hashed values. If this number is too large, it would consume more memory than necessary.\n",
        "* `string_max_bytes`: The maximum length of a string in the IBLT. Defaults to `10`. Must be positive. The strings longer than `string_max_bytes` will be truncated.\n",
        "* `max_words_per_user`: The maximum number of strings each client is allowed to contribute. If not `None`, must be a positive integer. Defaults to `None`, which means all the clients contribute all their strings.\n",
        "* `max_heavy_hitters`: The maximum number of items to return. If the decoded results have more than this number of items, will order decreasingly by the estimated counts and return the top max_heavy_hitters items. Defaults to `None`, which means to return all the heavy hitters in the result.\n",
        "* `secure_sum_bitwidth`: The bitwidth used for secure sum. The default value is\n",
        "`None`, which disables secure sum. If not `None`, must be in the range `[1,62]`. See `tff.federated_secure_sum`.\n",
        "* `multi_contribution`: Whether each client is allowed to contribute multiple counts or only a count of one for each unique word. Defaults to `True`. This argument could improve the utility when differential privacy is required.\n",
        "* `batch_size`: The number of elements in each batch of the dataset. Defaults to `1`, means the input dataset is processed by `tf.data.Dataset.batch(1)`. Must be a positive integer.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9iyRWmV529qY"
      },
      "outputs": [],
      "source": [
        "max_words_per_user = 8\n",
        "iblt_computation = tff.analytics.heavy_hitters.iblt.build_iblt_computation(\n",
        "    capacity=100,\n",
        "    string_max_bytes=20,\n",
        "    max_words_per_user=max_words_per_user,\n",
        "    max_heavy_hitters=10,\n",
        "    secure_sum_bitwidth=32,\n",
        "    multi_contribution=False,\n",
        "    batch_size=batch_size)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Qe8ZUIwH4C1y"
      },
      "source": [
        "Now you are ready to run simulations with TFF computation `iblt_computation` and the preprocess input dataset. The output `iblt_computation` of has four attributes:\n",
        "\n",
        "*   clients: A scalar number of clients that participated in the computation.\n",
        "*   heavy_hitters: A list of aggregated heavy hitters.\n",
        "*   heavy_hitters_counts: A list of the counts of aggregated heavy hitters.\n",
        "*   num_not_decoded: A scalar number of strings that are not successfully decoded.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5r8Y6GL-zhPv"
      },
      "outputs": [],
      "source": [
        "def run_simulation(one_round_computation: tff.Computation, dataset):\n",
        "  output = one_round_computation(dataset)\n",
        "  heavy_hitters = output.heavy_hitters\n",
        "  heavy_hitters_counts = output.heavy_hitters_counts\n",
        "  heavy_hitters = [word.decode('utf-8', 'ignore') for word in heavy_hitters]\n",
        "\n",
        "  results = {}\n",
        "  for index in range(len(heavy_hitters)):\n",
        "    results[heavy_hitters[index]] = heavy_hitters_counts[index]\n",
        "  return output.clients, dict(results)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "w99wVdhW0OIR"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Number of clients participated: 10\n",
            "Discovered heavy hitters and counts:\n",
            "{'to': 8, 'the': 8, 'and': 7, 'you': 4, 'i': 4, 'a': 3, 'he': 3, 'your': 3, 'is': 3, 'of': 2}\n"
          ]
        }
      ],
      "source": [
        "clients, result = run_simulation(iblt_computation, dataset)\n",
        "print(f'Number of clients participated: {clients}')\n",
        "print('Discovered heavy hitters and counts:')\n",
        "print(result)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "u4SdslRULCox"
      },
      "source": [
        "## Private Heavy Hitters with Differential Privacy"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0F4O2U7nGL1A"
      },
      "source": [
        "To obtain private heavy hitters with central DP, a DP mechanism is applied for\n",
        "open set histograms. The idea is to add noise to the counts of strings in the\n",
        "aggregated histogram, then only keep the strings with counts above a certain\n",
        "threhold. The noise and threhold depends on (epsilon, delta)-DP budget, see\n",
        "[this doc](https://github.com/google/differential-privacy/blob/main/common_docs/Delta_For_Thresholding.pdf)\n",
        "for detailed algorithm and proof. The noisy counts are rounded to integers as a\n",
        "post-processing step, which does not weaken the DP guarantee. Note that you will\n",
        "discover less heavy hitters when DP is required. This is because the\n",
        "thresholding step filters out strings with low counts."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ryZZgH8nJi9v"
      },
      "outputs": [],
      "source": [
        "iblt_computation = tff.analytics.heavy_hitters.iblt.build_iblt_computation(\n",
        "    capacity=100,\n",
        "    string_max_bytes=20,\n",
        "    max_words_per_user=max_words_per_user,\n",
        "    secure_sum_bitwidth=32,\n",
        "    multi_contribution=False,\n",
        "    batch_size=batch_size)\n",
        "\n",
        "clients, result = run_simulation(iblt_computation, dataset)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "LxhBSUFs3Ku6"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Discovered heavy hitters and counts with central DP:\n",
            "{'the': 8, 'you': 4, 'to': 7, 'tear': 3, 'and': 7, 'i': 3}\n"
          ]
        }
      ],
      "source": [
        "# DP parameters\n",
        "eps = 20\n",
        "delta = 0.01\n",
        "\n",
        "# Calculating scale for Laplace noise\n",
        "scale = max_words_per_user / eps\n",
        "\n",
        "# Calculating the threshold\n",
        "tau = 1 + (max_words_per_user / eps) * np.log(max_words_per_user / (2 * delta))\n",
        "\n",
        "result_with_dp = {}\n",
        "for word in result:\n",
        "  noised_count = result[word] + np.random.laplace(scale=scale)\n",
        "  if noised_count \u003e= tau:\n",
        "    result_with_dp[word] = int(noised_count)\n",
        "print(f'Discovered heavy hitters and counts with central DP:')\n",
        "print(result_with_dp)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [],
      "name": "private_heavy_hitters.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
